Skip to content

PERF: New intermediate representation#278

Merged
omerbenamram merged 16 commits intomasterfrom
ir-parser
Dec 30, 2025
Merged

PERF: New intermediate representation#278
omerbenamram merged 16 commits intomasterfrom
ir-parser

Conversation

@omerbenamram
Copy link
Owner

@omerbenamram omerbenamram commented Dec 30, 2025

This consolidates all performance work into a single end‑to‑end refactor:

  • We consolidate all internal representation into a single IR structure.
  • ~zero allocation now in the sense that we almost never call malloc anymore.
  • introduces UTF-16 SIMD escaping.
  • use zmig for parsing floats
  • jiff for datetimes

Architecture (before → after) — allocation savings highlighted

BEFORE (streaming output, token pipeline)

  EVTX file
    |
    v
  BinXML streaming decoder
    |  (binxml::tokens + binxml::deserializer)
    v
  Token vectors (Vec<Token>)
    |    + alloc: token list per record
    v
  Template instantiation (legacy)
    |  (binxml::assemble)
    |    + alloc: per-template token expansion
    |    + alloc: repeated name/value copies
    v
  Streaming JSON/XML renderer (token-driven)
    |  (json_stream_output + xml_output)
    |    + alloc: transient strings
    |    + alloc: per-node scratch buffers
    v
  Output

AFTER (IR tree + streaming output)

  EVTX file
    |
    v
  BinXML → IR tree builder
    |  (binxml::ir + binxml::array_expand)
    |    - arena-backed nodes/strings (IrArena)
    |    - template cache reuse
    |    - spec-compliant array expansion in-place
    |    - WEVT fallback on corrupt templates
    v
  IR tree (typed, compact)
    |  (model::ir + model::ir_visit)
    |    - names/strings are arena slices
    |    - no token vectors
    v
  Streaming JSON/XML renderer
    |  (binxml::ir_json + binxml::ir_xml + binxml::value_render)
    |    - no serde_json / no HashMap
    |    - utf16-simd for bulk decode/escape
    |    - sonic-rs / zmij / jiff for fast formatting
    v
  Output

Allocation savings:

  • eliminate per-record Vec
  • eliminate per-template expansion buffers
  • reduce string copies via arena slices
  • avoid serde_json/HashMap materialization

Performance

  • Machine: Darwin 25.2.0 arm64
  • Sample: samples/security_big_sample.evtx (30MB)
  • Command: evtx_dump -o json
Threads master (49ad8ce) branch (cdcf30a) Speedup
1 575.8 ± 1.6 ms 153.8 ± 0.9 ms 3.74×
8 141.9 ± 0.7 ms 57.4 ± 0.8 ms 2.47×

Supersedes (will close after merge)

Breaking changes

  • Timestamp types changed from chrono::DateTime to jiff::Timestamp in:
    • EvtxRecord, EvtxRecordHeader, SerializedEvtxRecord
    • binxml::value_variant::BinXmlValue (FileTimeType, SysTimeType, and array variants)
  • Removed public token/DOM models:
    • model::deserialized (e.g., BinXMLDeserializedTokens, BinXmlTemplateRef, etc.)
    • model::xml (XmlModel, XmlElement, etc.)
    • binxml::deserializer
  • Removed public streaming/output interfaces:
    • JsonOutput, JsonStreamOutput, XmlOutput, BinXmlOutput re‑exports removed
    • EvtxRecord::into_output removed
    • EvtxRecord::into_json_stream removed
    • EvtxParser::records_json_stream removed
  • EvtxRecord public fields changed:
    • tokens: Vec removed
    • New: tree: IrTree, binxml_offset, binxml_size
  • wevt_templates API changes:
    • parse_temp_binxml_fragment / parse_wevt_binxml_fragment removed from exports
    • render_with_substitution_values renamed to render_with_values

Note

Major perf-focused refactor to a new IR, with removals of legacy token/streaming code and API changes.

  • Introduces arena-backed IR tree (model::ir) with in-place array expansion (binxml/array_expand.rs); removes binxml/deserializer.rs, binxml/assemble.rs, and token/DOM models
  • Simplifies evtx_dump: single JSON implementation (drops --json-parser and streaming path), updates WEVT rendering to render_*_with_values
  • Replaces chrono with jiff timestamps; adds sonic-rs, zmij, utf16-simd, bumpalo, ahash; updates Cargo features/benches and runs utf16-simd tests in CI
  • Adds new benches/binaries and perf/benchmark scripts; cleans up old compare tool; expands .gitignore for perf/sample dirs
  • Breaking: timestamp types changed; removed public token/streaming APIs and fields; EvtxRecord now exposes IR tree and binxml offsets; wevt_templates API renamed

Written by Cursor Bugbot for commit 2917be0. This will update automatically on new commits. Configure here.

@omerbenamram omerbenamram changed the title Ir parser PERF: New intermediate representation Dec 30, 2025
@omerbenamram omerbenamram force-pushed the ir-parser branch 2 times, most recently from 7a4b019 to 4c6aebf Compare December 30, 2025 23:09
@omerbenamram omerbenamram merged commit 61a7fad into master Dec 30, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant